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Telecommnnicatinn System, and Method 
The invention relates to a telecommunication system and a telecommunication 

r"r a 1,1 ~ to > 

When someone is talking on the phone in a noisy environment, the phone's 
microphone picks up not only the speaker's voice but also interfering sounds which 
both are converted and transmitted to the remote party, that is J receiver The 

™1 " r e r ndS ' m ° re under ^** of the speaker to the 
remote party is reduced. . 

To overcome this problem, there are solution, especially with current mobile 
telephone, wherem noise reduction algorithms are applied to the recetrf 
nncrephone s.gnal to i roprove to quality of ttte sound received at the remote end . 

ST^T T"" t ^ USe ° flhe ^ -gions of environmental 

no.se and of vo.ee. However, such current noise reduction cannot cope well with 
env^nments such as noisy cmwds, bars and so on, wherein the surrounding 
mterienng no.se has a very similar spectrum and loudness as the used signal the 

voS^aT Ch ' " m enVfa0 ° me,,t S ° UOd C ° mPriSeS " ,M ° f person's 
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STS 3 SP Tl er ^ 10 ° VerC ° me ^ PrOWem by more approaching to the 
microphone and by speaking louder. This is very often not very successful 
Furthermore video telephony applications are recendy of more interest, at which 
applications the microphone is typically held at a viewing distance from the speaker's 
race and thus the microphone is farther away from the speaker's mouth. Thus, the 
signal to interference ratio is far more reduced. 

It is an object of the present invention to enhance the signal to interference ratio even 
in very noisy environments. 

The invention is set forth in the independent claims. 
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The general idea is to use a Up reading process to provide additional information 
within a noise reduction process. Such a lip reading process allows a feature 
extraction and the use thereof by detecting the lips position of the speaker only and 
analysing it. Based on said analysis, that is a lip reading result, the noise reduction 
algorithm can use additional estimates of speech energy or hints to collect statistics 
of the speaker's voice to better separate the speaker's voice from the surrounding 
sound comprising surrounding noise and surrounding persons' voices. 

As in video telephony situations a camera is built in, said built-in camera can be used 
for detecting the lips position of the speaker and thus only the analysis and separation 
need additionally be done. The invention may also be used in video conference 
systems in which system very often more persons are sitting in one room with 
separate cameras and microphones, however may disturb and interfere with the 
microphones of the other persons. 

Thus, the main advantage is that noise reduction in a phone situation can be 
enhanced in crowded situations. 

Main and additional features of the present invention are 

a) visual information, picked up by a buat-in camera in the phone, is added to audio 
information picked up by a microphone, to form a speech signal that is 
transmitted to the remote party, 

b) in a typical video telephony use case, the image which is already obtained in order 
to be transmitted to the remote party can also be used for the purpose described 
here, 

rcal-ume image processing is applied tQ-tfiecamera image in~ordeiM6-eXtfacra~- 
small number of relevant features of the speaker's mouth, 

d) the mouth features are further processed to provide input to the audio noise 
reduction algorithm. This could be e.g. the opening of the mouth as an estimate 
of speech energy; rapid movement of the mouth as a hint to consonants, etc. , 

e) the noise reduction algorithm is extended to make use of features of the speaker's 
mouth obtained visually. 

The invention is described with reference to a non-limiting typical embodiment as 
shown in the Fig. of the accompanying drawing. 

The figure shows a typical phone apparatus 1 like video telephone having inter alia 
built-in a microphone M and a video camera C outputting signals which correspond 
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to the speech of a speaker S and the corresponding lips position of the sneaker-, 
mourn. TW. u-ose signals are convert to digita. signais as ll^l 
shown by A^conveners 2 and 3, respectively, both shown by broken Unes ^ 
be ^understood, tha, A/D-converters 2, 3 may be incorporated into the microphone M 
and/„ the camera C, respectively, or those may be adopted to output 

The signal thus corresponding to the Bp, position is input «, m ^ „ 
connected to a memory 5 storing typical ^goriUtms of speech in J asttocZZtL 
dte position of me tips of a speaker. Thus Urn analyser ^rmin^ JS^" * 
tire sound, recetved by microphone M is coming from the speaker and wlJT" 
diereof is coming iron, the environment, ma, is the environmental noise Id voC 
of surroundmg persons. The result of me analyse done by anajyser 4 and me 
stgna. received by microphone M am inpu, „> a sector / which 
d^hngmshes tim speaker's voice from the surrounding or environmental ln^., 
thus reduces the respective signal to interference ratio. 

From the foregoing description follows mat the signal conesponding to the lips 
posttmn may be derived or processed bom continuously and intermittent wher ^ 

ZtTl * '"" emitteaay ^ ^ ** ^ * ««* from J^Z 
result of analysis or separation. 

The output sigua! of the separator 6, mat is a signal containing mostly only , igrate 
correspondmg to tire speaker's voice ar«l having reduced if not cartel sou nd C 

£LIZ men, • I' ^ * a *** 7 Whi< * -»«- signals to 

«rans„«,teble signals accordmg to various standards, the output of which Verier 7 

is transmuted as symbolised by a transmission 8 shown in broken tines the output 
stgna. of the camera C, a video signal, may abo be input to the converter 7 and 
converted to transmittable signals as typica. with video telephone systems or with 
video conference systems. 

It should be noted expressively ma. apparahis 1 need no, be a mobile or cellular 
vtdeo telephone apparatus and ma, Uie invention is also applicable to any system 
comprising a microphone and a camera. J 

Further, the apparatus 1 according to the invention mav ^ rise , , 
system wherein a learning program symbolised by a block 9 communicates wire the 
memory 5 of the algorithm of the speech. Thus, tile reduction of interfering noise 
can be former enhanced. E.g. initially, tha, is before using the apparatus 1 i„ the 
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environment, the speaker S speaks given sounds, that is vocals and consonants, after 
having switched on apparatus 1 in a very silent environment. Thus the analyser 
"knows" the speaker's lip positions with predetermined consonants and vocals and 
thus a better separation can be reached. This learning program can also be used at 
5 the very beginning of a communication to be done. 
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Claims 

1. Telecommunication system comprising 

de ten » i „ed S pe e c l , charactell ^ s ^ 1(| S,,m>UndmS 1 «*- »— " -. *e 

*e hps posmon of ^ ^ „ detected tatemrittativ 

speech characteristics of the speaker (S) are determine k ™ tenmttentI 5'' 
3° lips position determined by analysing the detected 

characteristics and ' ° n 831(1 deter mined 

35 ^ Separated Speaker ' s voice is verted to transmittal signals. 
3. The telecommunication system according to claim 1 ♦ , 

detected by means of a camera (C) viewing to the speaker's face. 
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4. System and method according to claim (3) wherein the camera (C) is a built-in 
camera of a video phone apparatus. 

5. Method according to anyone of claims 2 to 4 wherein the speech characteristics 
5 are based on the opening of a speaker's mouth as an estimate of the speech 

energy, rapid movement of the speaker's mouth as a hint to consonants and other 
statistically detected characteristics of an association between position and 
movement of the lips of a person and the thus output voice of the speaker. 

10 6. Method according to claim 5 wherein initially or over time of use a learning 
procedure is used to enhance the steps of determination and separation. 
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Abstract 



There « provided a telecommunication system and method wherein not only the 
speaker s voice but also the lips position of the speaker's mouth is detected (M- Q 
and is analysed (4). Based thereon the speaker's voice can be separated (6 more 
efficiency from environmental sound including both environmental noise and 
surrounding persons' voices. 



(Fig- 1) 
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